Semi-Structured Data Extraction from Heterogeneous Sources
نویسندگان
چکیده
This paper concerns the extraction of semi-structured data from Web pages generated from multiple on-line services. This task is addressed by representing the schemas for semi-structured data and crafting generic wrappers based on the schemas. We introduce a hybrid representation method for schemas of semi-structured data, consisting of a concept hierarchy and a set of knowledge unit frames. A content-based and structure-bounded information extraction algorithm is developed to build the generic wrapper, which utilizes the schemas and takes advantage of the semi-structured page layouts. The main advantages of the system are that a single wrapper can be applied to multiple Web sites, and the wrapper can handle resources with missing data and data presented in free texts, which can not be wrapped by existing techniques. The hybrid representation has been used for writing schemas for seven domains. Experiments in two domains, on-line real estate advertisements and car advertisements, show that the generic wrapper is robust for many flexible data presentations and page structures.
منابع مشابه
Combining Data Integration with Natural Language Technology for the Semantic Web
Current data integration systems allow a variety of hetero geneous structured or semi structured data sources to be combined and queried by providing an integrated view over them The Semantic Web also requires us to be able to integrate information from a variety of heterogeneous information sources However these information sources will also include natural language e g web pages and ontologie...
متن کاملSurvey on Mining in Semi-Structured Data
Emerging technologies of semi-structured data have attracted wide attention of networks, e-commerce, information retrieval and databases. In these applications, the data are modeled not as static collections but as transient data streams, where the data source is an unbounded stream of individual data items. It is becoming increasingly popular to send heterogeneous and ill-structured data throu...
متن کاملOntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction
This paper describes OntoExtractor a tool for extracting metadata from heterogeneous sources of information, producing a “quick-and-dirty” hierarchy of knowledge. This tool is specifically tailored for a quick classification of semi-structured data. By this feature, OntoExtractor is convenient for dealing with a web-based data source.
متن کاملThe MOMIS-STASIS approach for Ontology-Based Data Integration
Ontology based Data Integration involves the use of ontology(s) to effectively combine data and information from multiple heterogeneous sources [18]. Ontologies can be used in an integration task to describe the semantics of the information sources and to make the contents explicit. With respect to the integration of data sources, they can be used for the identification and association of seman...
متن کاملAn ontology-based approach for resolving semantic schema conflicts in the extraction and integration of query-based information from heterogeneous web data sources
There are many external resources and heterogeneous data on the internet that an organization or user may need to improve the decision making process. It is therefore, very important and critical that this information are complete, precise and can be acquired on time. Most web sources provide data in semi-structured form on the internet. The combination of semi-structured data from different so...
متن کامل